Gene-culture coevolution (GC-coev) is one of the main challenge of evolutionary archaeology. It requires the cross study of aDNA (‘Who ?’) and the different aspects of the culture (‘What ?’).
This doc is a HTML presentation host on GitHub, which brings together R + Leaflet coding showing how to reuse mtDNA data and perform cross-analysis between gene, culture and radiocarbon dates. This overview focus on the the transition between hunter-gatherers (HG) and early farmers (EF) in the Central and West Europe
The mitochondrial DNA (mtDNA) makes it possible to trace the maternal line. It passes from the mother to her children (of both sexes). Published mitochondrial sequences coming from the ancient DNA samples (aDNA) can be download from the Ancient mtDNA database (AmtDB). This database gathers . The whole dataset is composed by the data and the metadata
The data file mtdb_1511.fasta is downloaded from the AmtDB. The current metadata format is .fasta, a text-based format for representing either nucleotide sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. FASTA formats can be read with the phylotools package. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data:
fasta <- read.fasta("mtdb_1511.fasta")
fasta$seq.name[1]
## [1] "RISE509"
substr(fasta$seq.text[1], 1, 250)
## [1] "GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATCCTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACTTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAAT"
Each of these identifiers is a unique key for a sample. The data file counts 1511 different identifiers. These identifiers which allow to join the second dataset, metadata for cross-analysis
The metadata file metadata is a .csv file recording the site name, culture, epoch, geographical coordinates, etc.:
| identifier | alternative_identifiers | country | continent | region | culture | epoch | group | comment | latitude | longitude | sex | site | site_detail | mt_hg | ychr_hg | year_from | year_to | date_detail | bp | c14_lab_code | reference_names | reference_links | reference_data_links | c14_sample_tag | c14_layer_tag | avg_coverage | sequence_source | ychr_snps |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| OBKR_9B | 9 SK B | Germany | Europe | central Europe | Early Bronze Age | Bronze Age | EBAGe | 48.26652 | 10.878157 | F | Königsbrunn, Obere Kreuzstraße | U5a1i1 | -2032 | -1903 | cal BC 2032-1903 | 3616±23 | MAMS-18892 | Knipper et al. 2017 | https://doi.org/10.1073/pnas.1706355114 | https://www.ncbi.nlm.nih.gov/popset/1240339410 | 1 | 0 | 160.000 | reconstructed | ||||
| I4073 | skeleton 236-M13 | The Netherlands | Europe | western Europe | Bell Beaker | Copper Age | BBC | 52.73356 | 5.096183 | M | De Tuithoorn, Oostwoud, Noord-Holland | U5a2b3 | R1b1a1a2a1a2 | -2195 | -1905 | 2195-1905 calBCE (3660±50 BP, GrA-15598) | 3660±50 | GrA-15598 | Olalde et al. 2018 | https://dx.doi.org/10.1038/nature25738 | http://www.ebi.ac.uk/ena/data/view/PRJEB23635 | 1 | 0 | 238.708 | bam | |||
| I5427 | A2055 | Greece | Europe | southern Europe | Greece_Peloponnese_Neolithic | Neolithic | NEGr | 36.64000 | 22.380000 | F | Diros, Alepotrypa Cave | K1a24 | -6005 | -5879 | 6005-5879 calBCE (7050±30 BP, PSUAMS-2682) | 7050±30 | PSUAMS-2682 | Mathieson et al. 2018 | https://dx.doi.org/10.1038/nature25778 | https://www.ebi.ac.uk/ena/data/view/PRJEB22652 | 1 | 0 | 0.000 | bam |
It counts 2426 samples and 29 columns.
Within the AmtDB database, one (1) field concerns the mtDNA hg (‘mt_hg’) and two (2) other fields give insights on the cultural membership of the sample:
To illustrate GC-coev, we will focus on the culture tags (i.e., ‘culture’ column) of the Mesolithic/Neolithic transition (i.e., ‘epoch’ == ‘Mesolithic’ or ‘Neolithic’). Our region of interest is the Central and Western Mediterranean (i.e., from Greece to Spain). Samples with empty values in ‘mt_hg’ are removed. To have an easily visibly of the dataset we remove also the little represented cultures (Freq == 1)
df.MesoNeo <- df %>%
filter(mt_hg != "" & !is.na(mt_hg)) %>%
filter(epoch == 'Mesolithic' | epoch == 'Neolithic') %>%
filter(latitude < 46 & latitude > 35) %>%
filter(longitude > -6 & longitude < 28)
df.MesoNeo <- df.MesoNeo %>%
group_by(culture) %>%
filter(n() > 1)
MNcultures <- as.data.frame(table(df.MesoNeo$culture),
stringsAsFactors = F)
After processing, the new dataset counts 144 samples and 17 distinct cultures:
Figure 1: Counts of mtDNA samples for the main cultures
The selected mtDNA samples can be mapped. R and Leaflet allow to create interactive web maps and many ways to manage maps’ layout (colors, hyperlinks, etc.): mtDNA samples have the same color as the pie chart (Figure 1). Inside their popup (column ‘lbl’), the hg appears always colored in red, culture color depends on the culture, the mtDNA identifier and the bibliographical reference are hyperlinks
df.MesoNeo.a <- df.MesoNeo
df.MesoNeo.a <- merge(df.MesoNeo.a, df.colors, by.x="culture", by.y="Var1", all.x=T)
df.MesoNeo.a$lbl <- paste0("<b>", df.MesoNeo.a$site, "</b> <br>",
"hg: <b> <span style='color:red'>",
df.MesoNeo.a$mt_hg, "</span> </b>, ",
'identifier: <a href=',
shQuote(paste0(amtdb,df.MesoNeo.a[,'identifier'])),
"\ target=\"_blank\"", ">",
df.MesoNeo.a[,'identifier'], "</a> <br>",
"culture: <span style='color:",df.MesoNeo.a$color,";'>",
df.MesoNeo.a$culture, "</span> <i>",
df.MesoNeo.a$epoch, "</i><br>",
'ref: <a href=',
shQuote(paste0(df.MesoNeo.a[,'reference_links'])),
"\ target=\"_blank\"",
">", df.MesoNeo.a[,'reference_names'], "</a>")
leaflet(data = df.MesoNeo.a, width = "90%", height = "500px") %>%
addTiles(group = 'OSM') %>%
addCircleMarkers(layerId = ~identifier,
lng = ~longitude,
lat = ~latitude,
weight = 1,
radius = 3,
popup = ~lbl,
color = ~color,
opacity = 0.7,
fillOpacity = 0.7) %>%
addLegend("bottomleft",
colors = unique(df.MesoNeo.a$color),
labels= unique(df.MesoNeo.a$culture),
title = "cultures",
opacity = 1)
Figure 2: Sample and cleaned dataset
Graph drawing is a well-known heuristic to model relationships between discrete entities. With the R packages igraph and plotly, we can trace an enriched interactive graph to model the dataset structure. The default layout is a force-directed algorithm (e.g. Fruchterman-Reingold layout) allowing to bring closer or move away vertices depending on the edges they share. The graph is a multi-partite one with three (3) classes of vertices: cultures, haplogroups and identifiers. The mtDNA identifiers (leafs of the graph) are URLs hyperlinks pointing to the same records in the AmtDB database
Figure 3: AmtBD dataset structure modeled with graph
To go further with the GC-coev analysis, we can focus on co-occurences’ between hg and cultures. Here, we simply use the igraph package. The new graph is a bipartite one (i.e, 2-mode graph, two classes of vertices)
par(mar = c(0, 0, 0, 0))
edges <- df.MesoNeo.a[, c("culture", "mt_hg")]
nds.culture <- as.character(data$Var1)
nds.culture.color <- as.character(data$color)
nds.mt_hg<- unique(df.MesoNeo.a$mt_hg)
nds.mt_hg.color <- rep("grey", length(nds.mt_hg))
thm.ths.nds <- data.frame(name=c(nds.culture, nds.mt_hg),
color=c(nds.culture.color, nds.mt_hg.color),
stringsAsFactors = F)
g <- graph_from_data_frame(edges,
directed=F,
vertices= thm.ths.nds)
g <- simplify(g)
# is_simple(g)
set.seed(123)
laout <- layout_with_fr(g)
plot(g, vertex.label.cex= 0.7, layout = laout,
vertex.frame.color=NA,
vertex.label.color = "black",
vertex.label.family="Helvetica",
vertex.size=9)
Figure 4: Bipartite network of cultures and haplogroups (hg)
A first reading of the graph shows that:
The dataset variability can be reduced with agglomeration techniques like the the community detection algorithm
To to detect communities of each vertex, one can uses the ‘fast greedy’ algorithm (fastgreedy.community from the igraph package. This algorithm is a hierarchical ranking algorithm where initially each vertex belongs to a distinct community, and the communities are merged iteratively, so that each merge is locally optimal. The algorithm stops when it is no longer possible to increase the modularity, it will be thus gives a grouping as well as a dendrogram. This algorithm is close to the agglomeration of Ward used in hierarchical clustering (HC)
Figure 5: Community detetction on the bipartite network of cultures and haplogroups (hg)
A first reading of the clusters show a clear separation (i.e. edge distance) between three groups:
Surely, this GC-coev attempt needs to be refined. The cultural membership of mtDNA sample can be discussed upstream specifying the cultures they belong to, or downstream, choosing a different community detection algorithm, etc.
With their unique LabCode, the dates associated with mtDNA samples can be easily be connected to existing databases. We will use the metadata .csv file mtdb_metadata.csv to join the sample identifiers with the radiocarbon dates. These latter have a unique laboratory number (LabCode). This permit to associate the mtDNA sample with the radiocarbon date.
The dataset needs to be cleaned keeping interesting fields and removing samples with typo errors (like longitude < - 90). For this training, the selected samples are those associated to radiocarbon dates: empty radiocarbon values and empty LabCode are removed. The radiocarbon field ‘bp’ is not strictly formatted, a text edition has to be done with regular expressions (regex) by splitting the values from ‘bp’ field (split on ‘±’) to copy them in two: columns BP and SD. Finally, the analysis will still focus on Neolithic in Spain
selected.fields <- c("identifier", "site", "country", "culture", "epoch", "mt_hg", "c14_lab_code", "bp", "longitude", "latitude", "reference_names", "reference_links")
df <- read.csv(metadata, encoding = "UTF-8")
df <- df[, selected.fields]
df <- df %>%
filter(mt_hg != "" & !is.na(mt_hg)) %>%
filter(bp != "" & !is.na(bp)) %>%
filter(c14_lab_code != "" & !is.na(c14_lab_code)) %>%
filter(str_detect(bp, "±")) %>%
filter(!str_detect(bp, "BP|and")) %>%
filter(country == "Spain") %>%
filter(epoch == 'Neolithic')
bps <- unlist(sapply(df$bp,
function(x) strsplit(x, "±")),
use.names = FALSE)
mat <- matrix(bps, ncol=2, byrow=TRUE)
df$c14bp <- as.integer(mat[, 1])
df$c14sd <- as.integer(mat[, 2])
df$bp <- NULL
Now the new dataset has 10 samples and 13 columns:
| identifier | site | country | culture | epoch | mt_hg | c14_lab_code | longitude | latitude | reference_names | reference_links | c14bp | c14sd |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| I0410 | Els Trocs | Spain | Iberia_EN | Neolithic | T2c1d or T2c1d2 | MAMS-16161 | 0.500000 | 42.5000 | Haak et al. 2015 | https://dx.doi.org/10.1038/nature14317 | 6217 | 25 |
| CB13 | Cova Bonica, Vallirana, Barcelona | Spain | Iberia_EN | Neolithic | K1a2a | Beta-384724 | 1.890000 | 41.3700 | Olalde et al. 2015 | https://doi.org/10.1093/molbev/msv181 | 6410 | 30 |
| I0412 | Els Trocs | Spain | Iberia_EN | Neolithic | N1a1a1 | MAMS-16164 | 0.500000 | 42.5000 | Haak et al. 2015 | https://dx.doi.org/10.1038/nature14317 | 6249 | 28 |
| I0413 | Els Trocs | Spain | Iberia_EN | Neolithic | V | MAMS-16166 | 0.500000 | 42.5000 | Haak et al. 2015 | https://dx.doi.org/10.1038/nature14317 | 6234 | 28 |
| I1972 | El Prado de Pancorbo, Burgos | Spain | Iberia_EN | Neolithic | K1a4a1 | Beta-366569 | -3.116490 | 42.6282 | Lipson et al. 2017 | https://dx.doi.org/10.1038/nature24476 | 5880 | 30 |
| Val05 | Valdavara 1, Becerrea, Lugo | Spain | Iberian_MN | Neolithic | H1ay | Beta-235729, Beta-235730 | -7.231459 | 42.8571 | Gonzalez-Fortes et al. 2019 | https://doi.org/10.1098/rspb.2018.2288 | 4490 | 40 |
The dataset can be spatialized
Figure 6: mtDNA samples for the Neolithic of Spain
R offers packages to calibrate dates like Bchron or rcarbon. Using this latter, we plot the dates and add the identifiers of mtDNA samples for each summed probability distributions
idfs <- paste0(df.MesoNeo.b$c14_lab_code, "\n", df.MesoNeo.b$identifier)
c14dates <- calibrate(x = df.MesoNeo.b$c14bp,
errors = df.MesoNeo.b$c14sd,
calCurves = 'intcal20',
ids = idfs,
verbose = FALSE)
multiplot(c14dates,decreasing = TRUE,
rescale = TRUE,
HPD = TRUE,
col.fill = 'lightgrey')
Figure 7: Calibration of radiocarbon dates associated with mtDNA data for the Neolithic of Spain
If one wants to study contemporaneous mtDNA samples, a statistical test has be performed, like a Mann-Whitney (function wilcox.test) on the ageGrid (results of the function BchronCalibrate) for each radiocarbon date pairwise. When for a pairwise the p-value is superior to 0.05 (95%), the two dates are considered contemporaneous
l <- list()
for(i in 1:nrow(df.MesoNeo.b)){
ids <- df.MesoNeo.b[i, "identifier"]
a.date <- BchronCalibrate(ages=df.MesoNeo.b[i, "c14bp"],
ageSds=df.MesoNeo.b[i, "c14sd"],
calCurves='intcal20',
ids=df.MesoNeo.b[i, "identifier"])
lst <- list(eval(parse(text=paste0("a.date$",ids,"$ageGrid"))))
names(lst) <- ids
l[length(l)+1] <- lst
}
n.obs <- sapply(l, length)
seq.max <- seq_len(max(n.obs))
mat <- t(sapply(l, "[", i = seq.max))
rownames(mat) <- df.MesoNeo.b[, "identifier"]
mat.t <- t(mat)
n.col <- 1:length(rownames(mat))
lcompar <- t(combn(n.col, 2))
df.test <- data.frame(a = character(0),
b = character(0),
pval = integer(0))
for(i in 1:nrow(lcompar)){
p1 <- lcompar[i, 1]
p2 <- lcompar[i, 2]
a <- colnames(mat.t)[p1]
b <- colnames(mat.t)[p2]
pval <- round(wilcox.test(mat.t[, p1], mat.t[, p2])$p.value, 3)
df.test <- rbind(df.test, c(a, b, pval))
df.test <- rbind(df.test, c(b, a, pval))
}
colnames(df.test) <- c("Var1", "Var2", "value")
df.test$value <- as.numeric(df.test$value)
ggplot(data = df.test, aes(x=Var1, y=Var2, fill=value)) +
geom_tile(na.rm = T) +
scale_fill_gradient2(low = "yellow", high = "red", space = "Lab",
na.value = 'white')+
geom_text(aes(Var1, Var2, label = value),
color = "black", size = 2.5)+
theme_minimal()+
theme_bw()+
theme(axis.title = element_blank())+
theme(axis.text=element_text(size=7))+
theme(axis.ticks = element_blank())+
theme(axis.ticks.length=unit(.1, "cm"))+
theme(legend.position = "none")
Figure 8: Square matrix of Mann-Whitney test results on grid ages extents for all pairwise of radiocarbon dates
Only two mtDNA/radiocarbon dates have homogenized grid ages extents and can be considerated has contemporaneous, I0413 and I0410 from the Els Trocs cave (Huesca, Spain) (Haak et al. 2015).
| site | country | culture | identifier | mt_hg | c14_lab_code | c14bp | c14sd | reference_names | reference_links |
|---|---|---|---|---|---|---|---|---|---|
| Els Trocs | Spain | Iberia_EN | I0413 | V | MAMS-16166 | 6234 | 28 | Haak et al. 2015 | https://dx.doi.org/10.1038/nature14317 |
| Els Trocs | Spain | Iberia_EN | I0410 | T2c1d or T2c1d2 | MAMS-16161 | 6217 | 25 | Haak et al. 2015 | https://dx.doi.org/10.1038/nature14317 |
The time period covered by the Late Mesolithic-Early Neolithic (LM/EN) transition to the Late Neolithic (LN) represents three major demic diffusions:
Between these major population diffusions, smaller demic changes occurred. LM societies (ca. 7000-6000 BC) are considered to form the background of the early farmers (EF) arrivals. LM societies are composed by hunter-gatherers (HG) with a low net reproduction rate (Diamond 1998), generally a semi-sedentary residence (i.e. foragers), living in small groups with few storage capacities (Bradley 2003) and long-distance exchanges are sparse (Karamitrou-Mentessidi et al. 2013). Direct contacts between these HG and EF, or close settlements, are very few. In Eastern and Northern Europe, despite various material exchanges, few admixtures between EF farmers and indigenous HG have been observed: the hg U5 is typical of HG and the hg N1a is typical of EF. Mesolithics could have backed up to West Europe, the westernmost part of the Eurasian continent, and admixture could be more frequent in the West. In the middle Rhine valley, last HG populations have probably played a role in the Michelsberg cultural affirmation (Beau et al. 2017) and such interrogations also exist for the development of Megalithism along the Atlantic shore (Vincent and Emmanuel 2018).
Indeed, during this period and from Greece to Spain, we observed an absence of occupations (the so-called "hiatus" or "gap"). This "hiatus" is maybe due to a major mobility at the end of the 7th millenium BC linked to a major climatic event (8.2 Ky BP event) with drying and cooling in Europe (Guilaine 2011). The earliest Neolithic culture in Europe concerns the South-eastern part of the continent occupied by the PPC, ca 6500-6000 BC. After PPC diffusion in Greece, Thrace and the Balkan peninsula, a first stop in the progression is observed on its western and northern frontiers, respectively in western Greece with the PPC/ICC transition, and in the western Hungarian with the PPC/LBK transition.
After that, the two main currents for the beginning of farming in Central and Western Europe are the ICC (ca 5800-4900 BC) in the Mediterranean area, and the LBK (ca 5600-4900 BC) in the continent area. These currents where relatively fast even if other delays can also be observed in the diffusion of farming (Guilaine 2000-2001).
ICC current of farming is the fastest but shows spatial discontinuities (leapfrog movements) (Binder et al. 2017). The first phase of ICC, Impressa pioneer front, is short (ca 5800-5600 BC). The material culture of Impressa (e.g. lithics, subsistence system) shows significant differences with the Mesolithic indigenous. But contacts between Mesolithics and Early Neolithics exist: Impressa layers of Arene Candide (Finale Ligure, Liguria) show flints coming from a region controlled by late Mesolithic groups (Monti Lessini in Appenins)(Binder and Maggi 2001). The second phase of ICC, Cardial (ca 5600-5000 BC), shows more connections and admixture with the Mesolithic way of living with, for example, a part of the hunting in the subsistence system and adoption of lithic tools. This latter phase could be interpreted as an acculturation phase between Neolithics and Mesolithics.
LBK current of neolithisation shows a homogeneous culture (settlement, ceramic and lithic productions, subsistence system) recognized for a long time by archaeologists thanks to a favorable taphonomy of the settlements. As said, its diffusion seems to correspond to the IDD model where the demic factor is the most important one. Few genetic admixture has been recognized between Recent LBK farmers and latest hunter-gatherers even in Western Europe (Rivollat et al. 2016).
In Western Europe, first evidence of contacts between Recent LBK and Late Cardial (ca. 5200-4700 BC) takes place on both sides of an almost East-West middle-ground, extended from the middle part of the Rhone valley to the north of the Aquitaine basin. The Middle Neolithic (4500-3500 BC), except for the margins of Europe, is the achievement period of the farming diffusion process. This is a period of relative homogeneity and long-distance trades (green-stones, obsidian, bedoulian flint, etc.). It is not clear if this networks of distance trades were also the place of significant demic diffusion or if they were only traits exchanges. Megalithism starts as a response to a demographic stress (Hamon and Manen 2018). The mining phenomena also starts (e.g. in Chassean and Michelsberg contexts). Mining activities were probably the most complex activities of the period, they requires a complex organization all along the CO. At the westernmost part of the continent, the English Channel between the agro-pastoral continent and the British Islands, is crosses around 4000 BC by early farmers (Collard et al. 2010).